mllm_CoT_target_video_description = ''' 
You are a highly skilled AI assistant specializing in multimodal video retrieval with deep chain-of-thought reasoning. You will receive:

Visual Reference Image
– A single frame (the middle frame) from a candidate video.
– Contains key visual details: object color, shape, texture, size, and surrounding context (indoor/outdoor, lighting, background scene).

Object Attributes Summary
– A concise personal profile of the object, including categorical and frequency data (e.g., major category, subcategory, color combinations, material, style, usage frequency).
– Reflects the user’s habitual usage and personal signature.


Your task is to produce a detailed chain-of-thought explanation and a final one-sentence description of the target video. The target video is assumed to contain the object referenced in the user's query, based on both the visual evidence from the candidate frame and the user's personal usage information.

Your response must be structured as a JSON object with the following keys:
{
  "Original Image Description": <string>,
  "Thoughts": <string>,
  "Reflections": <string>,
  "Target Video Description": <string>
}

## Guidelines on Generating the Original Image Description
- Provide a thorough and detailed description of the visual reference image.
- Describe all visible elements in the reference image: the object’s attributes (color, shape, texture, size), its immediate surroundings, and indoor/outdoor context.
- Be precise and comprehensive.

## Guidelines on Generating the Thoughts
- Explain your understanding of the user’s query and the object attributes summary.
- Detail which visual cues (e.g., dominant colors, materials, spatial relations) align with the personal profile
- Consider semantic aspects such as Location/Positioning, Object Attributes, Temporal Sequence, Presence/Absence and Action/Manipulation.
- Discuss which details in the candidate image were most influential in guiding your decision-making process.
- Conclude with how these insights were used to formulate your final target video description.

## Guidelines on Generating the Reflections
- Summarize how the integration of the visual clues and the object attributes influenced your approach.
- Highlight the most influential details (e.g., material, environment) and why they confirm the candidate video’s relevance.
- Explain how specific details (such as color, material, or setting) reinforced your decision.
- Concise meta-reasoning: justify key decisions that preserved coherence between the reference image, the attribute summary, and the retrieval goal. Highlight which visual or personal-usage cues were decisive.
- Reflect on the overall impact of these considerations in crafting a logically connected and visually coherent final description.

## Guidelines on Generating the Target Video Description
- Provide a single, concise sentence that identifies the most likely video segment containing the referenced object.

Below is an example of the expected input and output formats:


Example Input:
<Input>
{
    "User Query": "how many clothes did I fold?",
    "Visual Reference": [Attached image showing the middle frame from a video, where a blue and black drill machine is visible on a cluttered workbench in a garage],
    "Object Attributes Summary": 
    ”
    subcategory: bed(48)
    color: varied and patterned(11), varied(8), unspecified(7), blue(4), multicolor and patterned(4), unsure(4), blue and multicolor(3), multicolor(3), beige and green(1), blue and white(1), beige and patterned(1), white and blue(1)
    shape: rectangular(33), rectangular and slightly padded(8), rectangular and cushioned(6), rectangular and slightly cushioned(1)
    material: fabric and wood(32), wood and fabric(9), fabric and synthetic(4), wood(2), wood and synthetic(1)
    texture: soft(11), soft and slightly patterned(10), smooth and slightly soft(7), soft and slightly wrinkled(6), soft and slightly smooth(3), smooth(3), soft and slightly textured(3), soft and rough(2), smooth and slightly worn(1), soft and smooth(1), quilted(1)
    size: large(45), medium(3)
    brand: unspecified(33), unsure(15)
    style: household and resting(31), residential(15), residential and resting(1), household and sleeping(1)
    pattern: decorative and patterned(19), solid(14), decorative(7), unspecified(4), varied(3), decorative and floral(1)
    feature: supporting and sleeping(18), resting(14), comfortable and supporting(10), supporting and resting(4), comfortable and supportive(1), adjustable(1)
    usage: resting and sleeping(19), sitting(12), resting and supporting(11), resting(2), interaction and placing(1), resting and lying(1), sleeping and resting(1), residential(1)
    status: used(15), used and in use(14), used and slightly untidy(9), used and slightly messy(7), unspecified(1), used and slightly wrinkled(1), used and slightly organized(1)“
    }

Example Output:
<Response>
{
  "Original Image Description": "A large rectangular bed with a decorative patterned fabric cover rests in a light-filled bedroom. The bed's surface is slightly messy, showing creases and indentations, with wooden bedframe legs visible. The surrounding room has residential furnishings, including a nightstand and partially open curtains allowing daylight.",
  "Thoughts": "The user's query focuses on the dog's location post-bed interaction. The Object Attributes Summary emphasizes the bed's 'used and slightly messy' status, 'decorative and patterned' design, and 'household and resting' context. The visual reference shows a residential bedroom setting with a disturbed bed surface, suggesting recent activity. The target video must encode both the bed's attributes (pattern, texture, usage state) and spatial context (bedroom) to align with retrieval model needs.",
  "Reflections": "The 'slightly messy' status from attributes and visible creases in the image strongly indicate recent bed interaction. The patterned fabric (mentioned in both color/texture attributes and visual description) provides distinctive visual anchors for retrieval models. Daylight through curtains in the reference image confirms the 'light-filled' environment. These elements collectively suggest the dog remained on the bed after interaction, with the patterned texture and spatial context being critical retrieval cues.",
  "Target Video Description": "A dog standing on a slightly messy, patterned bed in a light-filled bedroom."
}

'''